Generalized Linear Models I

Chelsea Parlett-Pelleriti

Linear Model

Linear Regression

\[ \mathbb{E}\left( y | x\right) = \mu = \mathbf{X}\beta \]

Linear Regression Assumptions

  • independently and identically distributed

  • linearity

  • normally distributed errors

  • homoskedasticity

I.I.D.

Data points are

  • independent: knowing the value of one observation doesn’t give you information about other observations

  • identically distributed: all observations are from the same underlying distribution with constant variance

❓ can you think of kinds of data that might violate this assumption?

I.I.D.

Risks with (unaccounted for) non-iid data:

  • biased standard error estimates

  • inflated error rates

  • inefficient estimates (don’t take into account dependence)

  • invalid p-values

Essentially…you have the WRONG model.

Linearity + Additivity

  • Linearity: As predictors (\(x_i\)s) increase, the amount our prediction increases is constant.

❓ can you think of an example where this might be a bad assumption?

Linearity + Additivity

  • Additivity: The effect of predictors (\(x_i\)s) do not impact the effect of other predictors.

❓ can you think of an example where this might be a bad assumption?

Linearity + Additivity

  • Linearity: As predictors (\(x_i\)s) increase, the amount our prediction increases is constant.

  • Additivity: The effect of predictors (\(x_i\)s) do not impact the effect of other predictors.

💡 Luckily there are some neat tricks to get around these while not technically violating them.

Normality (meh)

For basic linear regression we assume errors are distributed normally around the regression line.

  • no heavy tails (errors are more and more uncommon as they get larger)

  • no skewness (we’re not more likely to guess to high than too low)

Homoskedasticity

No matter what the predicted value is, the error variance is constant.

Preview

Don’t worry, the rest of this module is about breaking these assumptions.

Linear Regression Fitting

  • Least Squares

  • Maximum Likelihood Estimation

Least Squares

Least Squares

As the name implies, Least Squares aims to choose parameter values that *minimize* the sum of squared errors.

\[ \text{SSE} = \sum_{i = 1}^n(y_i - \beta_0 - \beta_1*x_i)^2 \]

SSE is one choice of a loss function.

Least Squares

Loss Function: a function that quantifies how “correct” the output of the model is. Lower values mean better model performance.

❓ why use a squared error rather than an absolute error?

Least Squares

Calculate the partial derivatives of each variable we can change (\(\beta_0\) and \(\beta_1\) here) w.r.t. the loss function.

\[\frac{\partial SSE}{\partial \beta_0} = \sum_{i = 1}^n 2(y_i - \beta_0 - \beta_1*x_i)(-1)\] \[\frac{\partial SSE}{\partial \beta_1} = \sum_{i = 1}^n 2(y_i - \beta_0 - \beta_1*x_i)(-x_i)\]

Least Squares

and set them equal to 0.

\[\frac{\partial SSE}{\partial \beta_0} = \sum_{i = 1}^n 2(y_i - \beta_0 - \beta_1*x_i)(-1) = 0\] \[\frac{\partial SSE}{\partial \beta_1} = \sum_{i = 1}^n 2(y_i - \beta_0 - \beta_1*x_i)(-x_i) = 0\]

❓ why do we set the derivatives equal to 0?

Least Squares

then we solve for \(\beta_0\) and \(\beta_1\) and we get:

\[\beta_0 = \bar{y} - \hat{\beta_1}* \bar{x}\]

and

\[ \beta_1 = \frac{Cov(x,y)}{Var(x)} = Corr(x,y) * \frac{sd(x)}{sd(y)} \]

Least Squares

These values for \(\beta_0\) and \(\beta_1\) are the ones that minimize our Sum of Squared Errors (\(\text{SSE}\)) and therefore give us a model that performs very well.

💡 notice, we solve for the optimal values directly in this case, however we can also solve these iteratively.

Regression Coefficients

Once we solve for the coefficients, we can use them to understand relationships between variables.

  • \(\beta_0\): when all the coefficients are 0, what is the predicted value?

    • this is why mean centering variables is helpful, it makes 0 a meaningful number
  • \(\beta_1\) (and others): when the predictor increases by 1 unit, how much does the predicted value change?

    • z-scoring/scaling helps put predictors on a similar scale, but it makes interpretation less straightforward

Regression Coefficients

  • \(\beta_n\) is working on a continuous variable: an increase in 1 unit of \(x\) will result in a \(\beta_n\) change in the predicted value.

  • \(\beta_n\) is working on a categorical variable: being in the current (non baseline) category will result in a \(\beta_n\) change in the predicted value.

Regression Coefficients

💡 Remember: the intercept is the predicted value *when all \(x\)s are equal to 0.

  • for continuous values: this is an actual 0 or often, if z-scored, 0 represents the average/mean

  • for categorical values: 0 represents all indicator variables being 0, which is the reference category

Maximum Likelihood Estimation

Maximum Likelihood Estimation

\[ p(y | x, \theta) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{x- \mu}{2\sigma^2}} \]

In Linear Regression, the likelihood function is the normal distribution with parameters \(\theta = \left( \mu, \sigma\right)\)

We want to choose values of \(\theta\) maximize the likelihood of the data, \(\left(x,y\right )\)

Maximum Likelihood Estimation

For a single data point the value of the likelihood function, \(L\left( \theta | x_i \right)\) is:

\[ \mathcal{L} \left( \theta | y_i \right) = p(y_i | x_i, \theta) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{y_i- \mu}{2\sigma^2}} \]

where \(\mu = g^{-1}(\mathbf{X}\beta)\).

Maximum Likelihood Estimation

Because of our assumption that data points are independent (the first i in i.i.d), the likelihood value for all data points is simply the product of their individual likelihood values, since \(p(A,B) = p(A)*p(B) \text{ iff } A \mathrel{\unicode{x2AEB}} B\).
\[ \mathcal{L}\left(\theta | \mathbf{y} \right) = p(\mathbf{y} | \mathbf{x}, \theta) = \prod_{i=1}^n p(y_i | x_i, \theta) = \prod_{i=1}^n\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{y_i- \mu_i}{2\sigma^2}} \]

Maximum Likelihood Estimation

The higher the likelihood of our data, the more evidence that a particular \(\theta\) is a good fit for the data.

Maximum Likelihood Estimation

\[ \text{arg max}_{\theta} \left[ \prod_{i=1}^n\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{y_i- \mu_i}{2\sigma^2}} \right] \]

to maximize, we:

  1. take the partial derivatives of \(\prod_{i=1}^n\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{y_i- \mu_i}{2\sigma^2}}\) w.r.t. each element of \(\theta\)
  2. set each \(\frac{\partial}{\partial \theta_i} = 0\)
  3. solve (analytically, Expectation Maximization, Gradient Descent…) for \(\theta\)


But…

Maximum Likelihood Estimation

…taking the derivative of a function of products is hard, so we use log likelihood.

\[ \mathcal{l}\left(\theta | \mathbf{y} \right) = \log\left(\mathcal{L}\left(\theta | \mathbf{y} \right)\right) = \\ -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log (\sigma^2) -\frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \mu_i)^2 \]

Note: \(\log()\) is a monotonically increasing function, so choosing \(\theta\) that maximizes \(\mathcal{l}\left(\theta | \mathbf{y} \right)\) will also maximize \(\mathcal{L}\left(\theta | \mathbf{y} \right)\)

Maximum Likelihood Estimation

If we choose our errors to be normally distributed, our likelihood function for a single data point was:

\[ \mathcal{L} \left( \theta | y_i \right) = p(y_i | x_i, \theta) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{y_i- \mu}{2\sigma^2}} \]


but what if we used a different distribution…?

Generalized

\[ \mathbb{E}\left (y | \mathbf{X}\right ) = \mu = g^{-1}\left (\mathbf{X}\beta \right ) \\ p\left (y|x\right ) \sim \pi\left (\mu,...\right ) \]

  • link function: \(g()\)

  • likelihood function: \(\pi()\)

Linear Models

\[ \mathbf{X}\beta \]

linear in the parameters”

Logistic Regression as a GLM

\[ \mathbb{E}\left( y | \mathbf{X} \right) = g^{-1}\left(\mathbf{X}\beta \right) \\ g\left(\mathbf{X} \right) = log\left( \frac{p}{1-p} \right); g^{-1}\left(\mathbf{X} \right) = \frac{1}{1+e^{-x}} \\ p\left (y | \mathbf{X} \right) \sim bernoulli\left( \mathbf{X}\beta \right) \]

  • Logit link function

  • Bernoulli likelihood

Maximum Likelihood Estimation

\[ \mathcal{L}\left( \theta | \mathbf{y}\right) = \prod_{i = 1}^n p(x_i)^{y_i} + \left[1- p(x_i) \right]^{1-y_i} \]

Notice, the likelihood function uses a Bernoulli distribution rather than a Normal distribution!

Choosing a Likelihood Function

how are errors distributed? what are the actual values of \(y\)?

  • linear regression: values are continuous, distributed normally

  • logistic regression: values are binary, distributed bernoulli

Robust Student t Regression

“fat tailed distributions, you make the rockin’ world go round”

❓ if you use a student t distribution as your likelihood function, how does the likelihood of extreme values change? What impact would that have on your regression?

Robust Student t Regression

  • link function: identity \(g(x) = x\), \(g^{-1}(x) = x\)

  • likelihood function: student-t with \(\nu\) degrees of freedom

\[ \mathbb{E}\left( y | \mathbf{X} \right) = g^{-1}\left(\mathbf{X}\beta \right) \\ g\left(\mathbf{X} \right) = \mathbf{X} ; g^{-1}\left(\mathbf{X} \right) = \mathbf{X} \\ p\left (y | \mathbf{X} \right) \sim t\left( \mathbf{X}\beta, \nu \right) \]

Robust Student t Regression

# sigma
s <- matrix(c(1, .6, 
              .6, 1), 
             nrow = 2, ncol = 2)
# mu
m <- c(0, 0)

# sim
n <- 50
set.seed(540)

d <- MASS::mvrnorm(n = n, mu = m, Sigma = s) %>%
  data.frame() %>%
  set_names(c("y", "x"))

# create outliers
o <- d
o[c(1:2), 1] <- c(6,5)
o[c(1:2), 2] <- c(-2, -1.5)

modified from: https://solomonkurz.netlify.app/blog/2019-02-02-robust-linear-regression-with-student-s-t-distribution/

Robust Student t Regression

Robust Student t Regression

library(brms)
biased_lm <-  brm(y ~ x, family = gaussian(), data = o)
Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
using C compiler: ‘Apple clang version 15.0.0 (clang-1500.3.9.4)’
using SDK: ‘MacOSX14.4.sdk’
clang -arch arm64 -std=gnu2x -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppParallel/include/"  -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include '/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/opt/R/arm64/include    -fPIC  -falign-functions=64 -Wall -g -O2  -c foo.c -o foo.o
In file included from <built-in>:1:
In file included from /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
In file included from /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/Eigen/Dense:1:
In file included from /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/Eigen/Core:19:
/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: 'cmath' file not found
#include <cmath>
         ^~~~~~~
1 error generated.
make: *** [foo.o] Error 1

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 2.3e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.23 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.008 seconds (Warm-up)
Chain 1:                0.007 seconds (Sampling)
Chain 1:                0.015 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.008 seconds (Warm-up)
Chain 2:                0.006 seconds (Sampling)
Chain 2:                0.014 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 1e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.007 seconds (Warm-up)
Chain 3:                0.007 seconds (Sampling)
Chain 3:                0.014 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 1e-06 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.008 seconds (Warm-up)
Chain 4:                0.007 seconds (Sampling)
Chain 4:                0.015 seconds (Total)
Chain 4: 
biased_lm
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y ~ x 
   Data: o (Number of observations: 50) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.26      0.20    -0.15     0.65 1.00     4083     2973
x             0.12      0.21    -0.31     0.53 1.00     3632     2735

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.48      0.15     1.21     1.81 1.00     3696     2910

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Robust Student t Regression

unbiased_lm <-  brm(y ~ x, family = student(), data = o)
Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
using C compiler: ‘Apple clang version 15.0.0 (clang-1500.3.9.4)’
using SDK: ‘MacOSX14.4.sdk’
clang -arch arm64 -std=gnu2x -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppParallel/include/"  -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include '/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/opt/R/arm64/include    -fPIC  -falign-functions=64 -Wall -g -O2  -c foo.c -o foo.o
In file included from <built-in>:1:
In file included from /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
In file included from /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/Eigen/Dense:1:
In file included from /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/Eigen/Core:19:
/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: 'cmath' file not found
#include <cmath>
         ^~~~~~~
1 error generated.
make: *** [foo.o] Error 1

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 3.6e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.36 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.02 seconds (Warm-up)
Chain 1:                0.019 seconds (Sampling)
Chain 1:                0.039 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 3e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.03 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.021 seconds (Warm-up)
Chain 2:                0.019 seconds (Sampling)
Chain 2:                0.04 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 3e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.03 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.022 seconds (Warm-up)
Chain 3:                0.019 seconds (Sampling)
Chain 3:                0.041 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 3e-06 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.03 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.02 seconds (Warm-up)
Chain 4:                0.019 seconds (Sampling)
Chain 4:                0.039 seconds (Total)
Chain 4: 
unbiased_lm
 Family: student 
  Links: mu = identity; sigma = identity; nu = identity 
Formula: y ~ x 
   Data: o (Number of observations: 50) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.10      0.14    -0.17     0.38 1.00     4000     2454
x             0.57      0.16     0.25     0.87 1.00     3638     2731

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.77      0.13     0.55     1.06 1.00     3550     2925
nu        3.57      1.63     1.64     7.48 1.00     3196     2711

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Beta Regression

Assume \(y \in [0,1]\), what would be a problem with using linear regression to predict \(y\)?

Note: technically, beta regression is not a GLM. But if it walks like a 🦆 and quacks like a 🦆, surely…

Beta Regression

Beta Regression

  • link function: logit \(g^{-1}(x) = \frac{1}{1 + e^{-x}}\), \(g(x) = log \left( \frac{x}{1-x}\right)\)

  • likelihood function: beta distribution with mean \(\mu\)

\[ \mathbb{E}\left( y | \mathbf{X} \right) = g^{-1}\left(\mathbf{X}\beta \right) \\ g\left(\mathbf{X} \right) = log\left( \frac{p}{1-p} \right); g^{-1}\left(\mathbf{X} \right) = \frac{1}{1+e^{-x}} \\ p\left (y | \mathbf{X} \right) \sim beta\left( \mathbf{X}\beta, \kappa \right) \]

Poisson Regression

Poisson Regression works with count data.

\[ \text{PMF} = \frac{\gamma^ke^{-\gamma}}{k!} \]

Poisson Regression

  • link function: \(g(x) = log(x)\), \(g^{-1}(x) = e^x\)

  • likelihood function: Poisson \(Pois(\gamma)\)

Note: for the Poisson Distribution, \(\gamma = \mu = \sigma^2\)

Poisson Regression

Negative Binomial Regression

Negative Binomial Regression

Negative Binomial Regression

  • link function: \(g(x) = log(x)\), \(g^{-1}(x) = e^x\)

  • likelihood function: negative binomial \(NB(\mu, \alpha)\)

    • When \(\alpha\) → 0: approaches Poisson (equidispersion)

    • When \(\alpha\) > 0: overdispersion (variance > mean)